Keyword spotting in unconstrained handwritten Chinese documents using contextual word model

نویسندگان

  • Liang Huang
  • Fei Yin
  • Qing-Hu Chen
  • Cheng-Lin Liu
چکیده

a r t i c l e i n f o Keywords: Keyword spotting Chinese handwritten documents Word similarity Contextual word model This paper proposes a method for keyword spotting in off-line Chinese handwritten documents using a contextual word model, which measures the similarity between the query word and every candidate word in the document by combining a character classifier and the geometric context as well as linguistic context. The geometric context model characterizes the single-character likeliness and between-character relationship. The linguistic model utilizes the dependency of the word with the external adjacent characters. The combining weights are optimized on training documents. Experiments on a large handwriting database CASIA-HWDB demonstrate the effectiveness of the proposed method and justify the benefits of geometric and linguistic contexts. Compared to transcription-based text search, the proposed method can provide higher recall rate, and for spotting words of four characters, the proposed method provides both higher precision and recall rate. Due to the huge volume of existing documents and the ever increasing new documents in daily life, the need of efficient document retrieval techniques is prominent. On scanning documents into digital images, character recognition and retrieval techniques can help efficiently sort the documents, summarize, find documents and locate regions of interest. Among the techniques of document retrieval [1,2], keyword spotting finds relevant documents containing the queried words and locates the word instances for further investigation. For printed documents of degraded image or handwritten documents, keyword spotting is still an unsolved problem due to the difficulty of word segmentation and recognition. Traditional character and word recognition techniques do not give sufficiently high accuracy on these documents, such that text search based on transcription (text line recognition) does not perform satisfactorily. For example, a state-of-the-art handwriting recognizer reports word recognition rate of 79.7% on online data and 74.1% on off-line data [3]. Compared to transcription-based text search, keyword spotting has the advantage that the information related to the query word can be exploited adequately to improve the recall rate of retrieval. This is particularly important for the less frequently used words, which are usually recognized less accurately by traditional recognizers. As a two-class (binary) problem, keyword spotting can easily get variable points of precision–recall tradeoff by setting variable decision thresholds. Transcription-based search is less flexible in this respect, particularly, when the recognition accuracy is low. The recall rate of transcription-based search can be improved by giving multiple …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

In keyword spotting from handwritten documents by text query, the word similarity is usually computed by combining character similarities, which are desired to approximate the logarithm of the character probabilities. In this paper, we propose to directly estimate the posterior probability (also called confidence) of candidate characters based on the N-best paths from the candidate segmentation...

متن کامل

Keyword Spotting from Online Chinese Handwritten Documents using One-versus-All Character Classification Model

In this paper, we propose a method for text-query-based keyword spotting from online Chinese handwritten documents using character classi ̄cation model. The similarity between the query word and handwriting is obtained by combining the character classi ̄cation scores. The classi ̄er is trained by one-versus-all strategy so that it gives high similarity to the target class and low scores to the oth...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

Zone-based Keyword Spotting in Bangla and Devanagari Documents

In this paper we present a word spotting system in text lines for offline Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown that zone-wise recognition method improves the word recognition performance than conventional full word recognition system in Indic scripts [29]. Inspired with this idea we consider the zone segmentation approach and use middle zone information ...

متن کامل

A probabilistic method for keyword retrieval in handwritten document images

Keyword retrieval in handwritten document images (word spotting) is very challenging given that OCR accuracy is not yet adequate for handwritten scripts, specially with large lexicons. Various proposed approaches build indices on information such as image features or OCR scores and have improved the performance of the traditional approach that builds index on OCR’ed text. In this paper, we impr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Image Vision Comput.

دوره 31  شماره 

صفحات  -

تاریخ انتشار 2013